The Byzantine Generals Problem
Understand consensus when nodes can exhibit arbitrary behavior.
We'll cover the following
At this point, we know that if a network can lose messages, the consensus is not guaranteed; or if a network can arbitrarily delay messages or nodes have arbitrarily long pauses, then a single crash failure can hinder achieving consensus. In this lesson, we’ll see a scenario where we assume that the network is non-faulty, but nodes can have serious (and hard to detect) Byzantine failures. We want to see if consensus is possible when Byzantine failures are possible in a non-faulty network.
Problem setup#
The Byzantine general’s problem has multiple generals (three shown in the picture below) who want to generate consensus about attacking a city. The challenge here is that some number of generals are traitors who can act maliciously (for example conveying different things to different generals). The goal is that non-Byzatine general reach a common decision. As in the two-general problem, the messages are conveyed by messengers. Contrary to the two-general setup, we assume messages get delivered reliably and are not lost.
Let's understand the challenges using the following two scenarios. In the first scenario, let's assume that General 2 is a traitor. General 1 sends attack messages to all. General 2 lies to General 3 that General 1 said to retreat. Now General 3 has two messages, saying conflicting things. To see why it is not easy for General 3 to decide which General is a traitor, let's see the second scenario, which from General 3's perspective, is indistinguishable from the first scenario. Here, General 1 is the traitor. So General 3 can't decide whether General 1 is the traitor or General 2. This makes us think that consensus is probably impossible in this specific case.
Leslie Lamport, Robert Shostak, and Marchall Pease provided that we need at least
There are different variants of Byzantine Generals Problem with varying setups. If we use digital signatures and assume that Byzantine nodes don't collude with each other, then a simple majority of nodes is enough to reach consensus amongst non-faulty nodes, possibly using a leader. So
Point to ponder
Question 2
In real systems, how can the bound be practically beneficial if we can’t always detect Byzantine nodes?
For a real system, there are multiple strategies that can be employed.
A system can strive to increase the number of non-faulty nodes in a system at all times to ensure that at least two-third of the total number of nodes are non-faulty. For such a system to work, we need to start from a known good state and never let the system drift to a bad state. Many blockchain systems are an example of such a scenario, where faulty nodes are actively detected and isolated before they become a bigger issue.
Some protocols use a leader-based approach where nodes only communicate with the leader and not among themselves. Doing so, we might be able to use bound, where a leader takes the majority vote.
Often, specific instances of a problem allow us to use special detectors of faults. For example, we could have a vantage point in a system where we could see (and later verify if need be) to get a holistic picture of who is doing what and what messages are exchanged. However, achieving this in a bigger system will be hard in terms of performance.
2 of 2
A real application#
Let's take the example of an online shop where three parties are involved—the customer, the online shop, and the payment service. All of them have less than entirely trusting relationships with each other. However, all of them also need to be on the same page about an order and related charges. How do we solve such a problem? Here, whatever is involved in an order is digitally signed, so we can reliably determine who said what, and no one can back off. Additionally, if things go wrong, we can use the audit trail (the equivalent of a log of who did what) and entity reputation mechanisms to solve them. This is an example that shows how one can go about solving Byzantine-like problems in specific scenarios.
With that, our study of fundamental consensus results is complete.
Conclusion#
We studied three fundamental consensus results—the Two Generals’ Problem, FLP impossibility, and the Byzantine Generals problem. These examples showed us how and when consensus is possible and what makes consensus impossible. All these results are established based on certain assumptions and associated guarantees. Consensus is such an important problem that we must strive to find ways out of impossibility results. Such compromise often comes by fiddling either with the assumptions or system guarantees. We will provide examples of such instances in the two-phase commit (2PC), Paxos, and Raft chapters. The purpose of this chapter was to establish a common terminology and a framework to understand the consensus difficulties and possible ways out. For example, we’re in a much better position to understand why the Paxos algorithm can stop making progress at times instead of just accepting it as some quirk of the algorithm.
Four Consensus Scenarios
Consensus Scenarios | Computational Model | Consensus Always Possible? |
Two Generals' Problem | Synchronous + Non-faulty nodes + Network can lose messages | No |
FL82 | Synchronous + Crash failures | Yes |
FLP (FLP85) | Asynchronous + Non-faulty network + Crash failures | No |
Byzantine Generals Problem | Partially synchronous + Byzantine failures | Yes, if in 3f+1 nodes, at most f are faulty |
FLP Impossibility
Quiz on Consensus Fundamentals